A Comparison of Evaluation Metrics for a Broad-Coverage Stochastic Parser

نویسندگان

Richard Crouch

Ronald M. Kaplan

Tracy H. King

Stefan Riezler

چکیده

This paper reports on the use of two distinct evaluation metrics for assessing a stochastic parsing model consisting of a broad-coverage Lexical-Functional Grammar (LFG), an efficient constraint-based parser and a stochastic disambiguation model. The first evaluation metric measures matches of predicate-argument relations in LFG f-structures (henceforth the LFG annotation scheme) to a gold standard of manually annotated f-structures for a subset of the UPenn Wall Street Journal treebank. The other metric maps predicate-argument relations in LFG f-structures to dependency relations (henceforth DR annotations) as proposed by Carroll et al. (Carroll et al., 1999). For evaluation, these relations are matched against Carroll et al.’s gold standard which was manually annnotated on a subset of the Brown corpus. The parser plus stochastic disambiguator gives an F-measure of 79% (LFG) or 73% (DR) on the WSJ test set. This shows that the two evaluation schemes are similar in spirit, although accuracy is impaired systematically by mapping one annotation scheme to the other. A systematic loss of accuracy is incurred also by corpus variation: Training the stochastic disambiguation model on WSJ data and testing on Carroll et al.’s Brown corpus data yields an F-score of 74% (DR) for dependency-relation match. A variant of this measure comparable to the measure reported by Carroll et al. yields an F-measure of 76%. We examine divergences between annotation schemes aiming at a future improvement of methods for assessing parser quality.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating a Wide-Coverage CCG Parser

This paper compares three evaluation metrics for a CCG parser trained and tested on a CCG version of the Penn Treebank. The standard Parseval metrics can be applied to the output of this parser; however, these metrics are problematic for CCG, and a comparison with scores given for standard Penn Treebank parsers is uninformative. As an alternative, we consider two evaluations based on headdepend...

متن کامل

Metrics and Evaluation Tools for Patient Engagement in Healthcare Organization- and System-Level Decision-Making: A Systematic Review

Background Patient, public, consumer, and community (P2C2) engagement in organization-, community-, and systemlevel healthcare decision-making is increasing globally, but its formal evaluation remains challenging. To define a taxonomy of possible P2C2 engagement metrics and compare existing evaluation tools against this taxonomy, we conducted a systematic review. Methods A broad search strate...

متن کامل

Bilexical Dependencies as an Intermedium for Data-Driven and HPSG-Based Parsing

Bilexical dependencies capturing asymmetrical lexical relations between heads and dependents are viewed as a practical representation of syntax that is well-suited for computation and intelligible for human readers. In the present work we use dependency representations as a bridge between data-driven and grammar-based parsing, both for cross-framework parser comparison and for parser integratio...

متن کامل

Chunking + Island-Driven Parsing = Full Parsing

We present a novel method for improving parsing performance, using a stochastic islanddriven chart parser preceded by a chunking process for identifying initial islands. Two different stochastic models have been developed for the island-driven parsing. Some experiments with nominal chunking using broad-coverage grammars derived from the Penn Treebank have been performed with remarkable results.

متن کامل

The Power of the TSNLP: Lessons from a Diagnostic Evaluation of a Broad-Coverage Parser

We show a diagnostic evaluation of DIPETT, a broad-coverage parser of English sentences. We consider the TSNLP suite as a diagnostic tool, and propose an alternative broader-coverage test suite of test sentences extracted from Quirk et al. We compare the diagnostic effectiveness of the two suites, and draw a few general conclusions. The evaluation results were used to make significant improveme...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

A Comparison of Evaluation Metrics for a Broad-Coverage Stochastic Parser

نویسندگان

چکیده

منابع مشابه

Evaluating a Wide-Coverage CCG Parser

Metrics and Evaluation Tools for Patient Engagement in Healthcare Organization- and System-Level Decision-Making: A Systematic Review

Bilexical Dependencies as an Intermedium for Data-Driven and HPSG-Based Parsing

Chunking + Island-Driven Parsing = Full Parsing

The Power of the TSNLP: Lessons from a Diagnostic Evaluation of a Broad-Coverage Parser

عنوان ژورنال:

اشتراک گذاری